Skip to content

Conversation

@dimitri-yatsenko
Copy link
Member

@dimitri-yatsenko dimitri-yatsenko commented Jan 12, 2026

Summary

Introduces the <npy@> codec for schema-addressed NumPy array storage with lazy loading, and refactors hash-addressed storage to use path-based retrieval.

Key Features

NpyCodec (<npy@>)

  • Lazy loading: Inspect array shape and dtype without downloading
  • Memory mapping: Random access to large arrays via mmap_mode
  • NumPy integration: Transparent array operations via __array__ protocol
  • Safe bulk fetch: Returns NpyRef objects instead of downloading all arrays
  • Portable format: Standard .npy files readable by NumPy, MATLAB, etc.
  • Schema-addressed: Paths derived from primary key ({schema}/{table}/{pk}/{attr}.npy)

Hash Registry Refactoring

  • Path-based retrieval: Full path stored in metadata, used directly for retrieval
  • Config-change protection: Stored paths guard against subfolding/structure changes
  • Per-schema isolation: Hash paths include schema name (_hash/{schema}/{hash})

Codec Types

Codec Store Description
<blob> In-table DataJoint serialization of Python objects
<blob@store> Hash-addressed Large blobs, deduplicated by hash
<attach@store> Hash-addressed File attachments, deduplicated by hash
<npy@store> Schema-addressed NumPy arrays with lazy loading ← this PR
<object@store> Schema-addressed Python objects, path from primary key

Plugin codecs (separate packages, coming soon):

  • <zarr@store> - Zarr arrays
  • <tiff@store> - TIFF images
  • <parquet@store> - Parquet tables

Addressing Schemes

Scheme Path Derived From Deduplication
Hash-addressed Content hash (MD5→Base32) Yes (per-schema)
Schema-addressed Primary key No

Usage

@schema
class Neuron(dj.Imported):
    definition = """
    -> Session
    neuron_id : int16
    ---
    activity : <npy@store>    # Lazy-loading array
    """

# Fetch returns NpyRef, not the array
ref = (Neuron & key).fetch1('activity')
print(ref.shape)      # (1000,) - no download
print(ref.dtype)      # float64 - no download

# Load when ready
array = ref.load()

# Memory-mapped for large arrays
mmap = ref.load(mmap_mode='r')
slice = mmap[1000:2000]  # Only reads needed portion

Changes

New:

  • hash_registry.py - Refactored from content_registry.py with path-based storage
  • SchemaCodec - Abstract base class for schema-addressed codecs
  • NpyRef - Lazy reference with metadata access
  • NpyCodec - Codec implementation using .npy format

Refactoring:

  • ObjectCodec now inherits from SchemaCodec
  • Renamed is_externalis_store throughout codebase
  • hash_registry functions use stored paths for retrieval
  • gc.py updated to work with paths instead of hashes

Test Plan

  • All 643 tests pass
  • Unit tests for NpyRef metadata and mmap_mode
  • Integration tests for roundtrip encode/decode
  • Integration tests for lazy loading and caching

Documentation

See datajoint-docs (docs-2.0-migration branch):


Co-Authored-By: Claude Code [email protected]

Add migrate_external() and migrate_filepath() to datajoint.migrate module
for safe migration of 0.x external storage columns to 2.0 JSON format.

Migration strategy:
1. Add new <column>_v2 columns with JSON type
2. Copy and convert data from old columns
3. User verifies data accessible via DataJoint 2.0
4. Finalize: rename columns (old → _v1, new → original)

This allows 0.x and 2.0 to coexist during migration and provides
rollback capability if issues are discovered.

Functions:
- migrate_external(schema, dry_run=True, finalize=False)
- migrate_filepath(schema, dry_run=True, finalize=False)
- _find_external_columns(schema) - detect 0.x external columns
- _find_filepath_columns(schema) - detect 0.x filepath columns

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@github-actions github-actions bot added enhancement Indicates new improvements feature Indicates new features labels Jan 12, 2026
Implement the `<npy@>` codec for schema-addressed numpy array storage:

- Add SchemaCodec base class for path-addressed storage codecs
- Add NpyRef class for lazy array references with metadata
- Add NpyCodec using .npy format with shape/dtype inspection
- Refactor ObjectCodec to inherit from SchemaCodec
- Rename is_external to is_store throughout codebase
- Export SchemaCodec and NpyRef from public API
- Bump version to 2.0.0a17

Key features:
- Lazy loading: inspect shape/dtype without downloading
- NumPy integration via __array__ protocol
- Safe bulk fetch: returns NpyRef objects, not arrays
- Schema-addressed paths: {schema}/{table}/{pk}/{attr}.npy

Co-Authored-By: Claude Opus 4.5 <[email protected]>
dimitri-yatsenko and others added 7 commits January 12, 2026 16:29
The SchemaCodec (used by NpyCodec and ObjectCodec) needs _schema,
_table, _field, and primary key values to construct schema-addressed
storage paths. Previously, key=None was passed, resulting in
"unknown/unknown" paths.

Now builds proper context dict from table metadata and row values,
enabling navigable paths like:
  {schema}/{table}/objects/{pk_path}/{attribute}.npy

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Merge PR #1330 (blob preview display) into feature/npy-codec.
Bump version from 2.0.0a17 to 2.0.0a18.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Address reviewer feedback from PR #1330: attr should never be None
since field_name comes from heading.names. Raising an error surfaces
bugs immediately rather than silently returning a misleading placeholder.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Support memory-mapped loading for large arrays:
- Local filesystem stores: mmap directly, no download
- Remote stores: download to cache, then mmap

Co-Authored-By: Claude Opus 4.5 <[email protected]>
…orage

Major changes to hash-addressed storage model:
- Rename content_registry.py → hash_registry.py for clarity
- Always store full path in metadata (protects against config changes)
- Use stored path directly for retrieval (no path regeneration)
- Add delete_path() as primary function, deprecate delete_hash()
- Add get_size() as primary function, deprecate get_hash_size()
- Update gc.py to work with paths instead of hashes
- Update builtin_codecs.py HashCodec to use new API

This design enables seamless migration from v0.14:
- Legacy data keeps old paths in metadata
- New data uses new path structure
- GC compares stored paths against filesystem

Co-Authored-By: Claude Opus 4.5 <[email protected]>
dimitri-yatsenko and others added 2 commits January 13, 2026 14:05
- Remove uuid_from_buffer from hash.py (dead code)
- connection.py now uses hashlib.md5().hexdigest() directly
- Update test_hash.py to test key_hash instead

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Remove dead code that was only tested but never used in production:
- hash_exists (gc uses set operations on paths)
- delete_hash (gc uses delete_path directly)
- get_size (gc collects sizes during walk)
- get_hash_size (wrapper for get_size)

Remaining API: compute_hash, build_hash_path, get_store_backend,
get_store_subfolding, put_hash, get_hash, delete_path

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@dimitri-yatsenko dimitri-yatsenko merged commit 471b8a9 into pre/v2.0 Jan 13, 2026
7 of 8 checks passed
@dimitri-yatsenko dimitri-yatsenko deleted the feature/npy-codec branch January 13, 2026 23:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Indicates new improvements feature Indicates new features

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants